Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition

Python Machine Learning - Code Examples

Chapter 4 - Building Good Training Sets – Data Preprocessing

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).



In [ ]:

    
%load_ext watermark
%watermark -a "Sebastian Raschka" -u -d -p numpy,pandas,matplotlib,sklearn

The use of watermark is optional. You can install this IPython extension via "pip install watermark". For more information, please see: https://github.com/rasbt/watermark.

Overview

Dealing with missing data
Handling categorical data
Partitioning a dataset into a separate training and test set
Bringing features onto the same scale
Selecting meaningful features
Assessing feature importance with Random Forests
Summary



In [ ]:

    
# Use the IPython/jupyter feature to show images inline with the notebook 
# output rather than have images popup.
from IPython.display import Image
%matplotlib inline

Dealing with missing data

Identifying missing values in tabular data



In [ ]:

    
# Sample csv

import pandas as pd
from io import StringIO
import sys

csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

# If you are using Python 2.7, you need
# to convert the string to unicode:

if (sys.version_info < (3, 0)):
    csv_data = unicode(csv_data)

df = pd.read_csv(StringIO(csv_data))
df



In [ ]:

    
# Give a count of null values for each column
df.isnull().sum()



In [ ]:

    
# access the underlying NumPy array
# via the `values` attribute
df.values

Eliminating samples or features with missing values



In [ ]:

    
# remove rows that contain missing values

df.dropna(axis=0)



In [ ]:

    
# remove columns that contain missing values

df.dropna(axis=1)



In [ ]:

    
# remove columns that contain missing values

df.dropna(axis=1)



In [ ]:

    
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
,,,
4.0,,,
4.0,6.0,,
5.0,6.0,,8.0
10.0,11.0,12.0,'''

# If you are using Python 2.7, you need
# to convert the string to unicode:

if (sys.version_info < (3, 0)):
    csv_data = unicode(csv_data)

df2 = pd.read_csv(StringIO(csv_data))
df2



In [ ]:

    
# only drop rows where all columns are NaN

df2.dropna(how='all')



In [ ]:

    
# drop rows that have less than 3 real values 

df2.dropna(thresh=3)



In [ ]:

    
# only drop rows where NaN appear in specific columns (here: 'C')

df2.dropna(subset=['C'])

Imputing missing values

Removing null data may be fine but it may reduce the volume of data too much.



In [ ]:

    
# again: our original array
df.values



In [ ]:

    
# impute missing values via the column mean

from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data



In [ ]:

    
# impute missing values via the row mean

imr = Imputer(missing_values='NaN', strategy='mean', axis=1)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

Documentation for sklearn.preprocessing.Imputer

Understanding the scikit-learn estimator API



In [ ]:

    
Image(filename='images/04_01.png', width=400)

Here we are using fit and transform methods as a preprocessor, for example from the Min Max Scaler, to map data from it's original form to one better suited for Machine learning.

Later the model created is used for transforming initial test data and then target data when it comes along.

As we have seen it is possible to either remove null data or impute values. One thing to bear in mind is the shape of the data you supply to the fit and transform methods must always be the same shape.



In [ ]:

    
Image(filename='images/04_02.png', width=300)

This time the fit method from, for example from Logistic Regression, is used with training data and training labels to generate a model.

This model is then used to predict outcomes of the test data. The output is labels.

Handling categorical data

Nominal and ordinal features



In [ ]:

    
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Mapping ordinal features

here we can change value in place, within the data frame



In [ ]:

    
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df

And we can map these back if required



In [ ]:

    
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

Encoding class labels



In [ ]:

    
import numpy as np

# create a mapping dict
# to convert class labels from strings to integers
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping



In [ ]:

    
# class_mapping is the code we pass to the map function 
# to convert class labels from strings to integers
df['classlabel'] = df['classlabel'].map(class_mapping)
df



In [ ]:

    
# reverse the class label mapping
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df



In [ ]:

    
# To avoid doing this by hand we can use the sklearn.preprocessing library
# LabelEncoder method

from sklearn.preprocessing import LabelEncoder

# Label encoding with sklearn's LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y



In [ ]:

    
# reverse mapping
class_le.inverse_transform(y)

Performing one-hot encoding on nominal features

Convert categorical variable(s) into dummy/indicator variables



In [ ]:

    
# Just looking at color, size & price we can convert non numeric data with
# the LabelEncoder

X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

What is the problem with this approach?



In [ ]:

    
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()



In [ ]:

    
# return dense array so that we can skip
# the toarray step

ohe = OneHotEncoder(categorical_features=[0], sparse=False)
ohe.fit_transform(X)



In [ ]:

    
df



In [ ]:

    
# one-hot encoding via pandas - just color as a nominal value

pd.get_dummies(df[['price', 'color', 'size']])



In [ ]:

    
# one-hot encoding via pandas - both color and class label as nominal values

pd.get_dummies(df[['price', 'color', 'size','classlabel']])



In [ ]:

    
# multicollinearity guard in get_dummies

pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)



In [ ]:

    
# multicollinearity guard in get_dummies
# - both color and class label as nominal values

pd.get_dummies(df[['price', 'color', 'size','classlabel']], drop_first=True)



In [ ]:

    
X



In [ ]:

    
# multicollinearity guard for the OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()



In [ ]:

    
ohe.fit_transform(X).toarray()[:, 1:]

Partitioning a dataset into a seperate training and test set



In [ ]:

    
df_wine = pd.read_csv('https://archive.ics.uci.edu/'
                      'ml/machine-learning-databases/wine/wine.data',
                      header=None)

# if the Wine dataset is temporarily unavailable from the
# UCI machine learning repository, un-comment the following line
# of code to load the dataset from a local path:

# df_wine = pd.read_csv('wine.data', header=None)


df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()



In [ ]:

    
from sklearn.model_selection import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test =\
    train_test_split(X, y, 
                     test_size=0.3, 
                     random_state=0, 
                     stratify=y)
# X data
# y class label that will be used to train
# Test size 0.3 = 30% test data, the rest training data

Bringing features onto the same scale

Most machine learning algorithms behave much better if features are on the same scale.



In [ ]:

    
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)



In [ ]:

    
X[0,:]



In [ ]:

    
X_train[0,:]



In [ ]:

    
X_train_norm[0,:]



In [ ]:

    
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)



In [ ]:

    
X_train_std[0,:]

A visual example:



In [ ]:

    
ex = np.array([0, 1, 2, 3, 4, 5])

print('standardized:', (ex - ex.mean()) / ex.std())

# Please note that pandas uses ddof=1 (sample standard deviation) 
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)

# normalize
print('normalized:', (ex - ex.min()) / (ex.max() - ex.min()))

Selecting meaningful features

If the models we create performs much better on a training dataset than on the test dataset then it is very likely there is a problem with overfitting.

Overfitting shows are model does not generalise well so do not work well with yet unseen data.

Some options to deal with this:

Collect more training data
Penalise complexity via regularisation
Try a simpler model with fewer parameters
Dimensional reduction

Collecting more training data may not be an option and trying simpler models with fewer parameters may come down to trial and error.

Next we will look at penalising complexity via regularisation. Then Dimensional reduction via feature selection.

L1 and L2 regularization as penalties against model complexity

A geometric interpretation of L2 regularization



In [ ]:

    
Image(filename='images/04_04.png', width=500)



In [ ]:

    
Image(filename='images/04_05.png', width=500)

Sparse solutions with L1-regularization



In [ ]:

    
Image(filename='images/04_06.png', width=500)

For regularized models in scikit-learn that support L1 regularization, we can simply set the penalty parameter to 'l1' to obtain a sparse solution:



In [ ]:

    
from sklearn.linear_model import LogisticRegression
LogisticRegression(penalty='l1')

Applied to the standardized Wine data ...



In [ ]:

    
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1', C=1.0)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy    :', lr.score(X_test_std, y_test))



In [ ]:

    
lr.intercept_

This shows the intercept of each of the three models being used.



In [ ]:

    
# A numpy function to set precision
np.set_printoptions(8)

Here we can see the total number of weights that have not been brought to zero by using L1 regularization out of the maximum of 39.

$(13 dimensions \times 3 classes)$



In [ ]:

    
lr.coef_[lr.coef_!=0].shape

Here we can see the all the weights for the three classes and the 13 dimensions in the wine dataset.



In [ ]:

    
lr.coef_

With this information we can graph now the regularization strength effects the weights.

The default LogisticRegression inverse of regularization strength is 1. We can use a simple loop to go from $10^{-4}$ to $10^5$ to get the weights and then graph them.

for c in np.arange(-4., 6.):
    lr = LogisticRegression(penalty='l1', C=10.**c, random_state=0)



In [ ]:

    
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.subplot(111)
    
colors = ['blue', 'green', 'red', 'cyan', 
          'magenta', 'yellow', 'black', 
          'pink', 'lightgreen', 'lightblue', 
          'gray', 'indigo', 'orange']

weights, params = [], []
for c in np.arange(-4., 6.):
    lr = LogisticRegression(penalty='l1', C=10.**c, random_state=0)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10**c)

weights = np.array(weights)

for column, color in zip(range(weights.shape[1]), colors):
    plt.plot(params, weights[:, column],
             label=df_wine.columns[column + 1],
             color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center', 
          bbox_to_anchor=(1.38, 1.03),
          ncol=1, fancybox=True)
#plt.savefig('images/04_07.png', dpi=300, 
#            bbox_inches='tight', pad_inches=0.2)
plt.show()

Below we can see the effect on the training and test accuracy.



In [ ]:

    
lr = LogisticRegression(penalty='l1', C=0.01)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy    :', lr.score(X_test_std, y_test))



In [ ]:

    
lr = LogisticRegression(penalty='l1', C=0.1)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy    :', lr.score(X_test_std, y_test))



In [ ]:

    
lr = LogisticRegression(penalty='l1', C=1)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy    :', lr.score(X_test_std, y_test))



In [ ]:

    
lr = LogisticRegression(penalty='l1', C=10)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy    :', lr.score(X_test_std, y_test))

Sequential feature selection algorithms



In [ ]:

    
from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


class SBS():
    def __init__(self, estimator, k_features, scoring=accuracy_score,
                 test_size=0.25, random_state=1):
        self.scoring = scoring
        self.estimator = clone(estimator)
        self.k_features = k_features
        self.test_size = test_size
        self.random_state = random_state

    def fit(self, X, y):
        
        X_train, X_test, y_train, y_test = \
            train_test_split(X, y, test_size=self.test_size,
                             random_state=self.random_state)

        dim = X_train.shape[1]
        self.indices_ = tuple(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X_train, y_train, 
                                 X_test, y_test, self.indices_)
        self.scores_ = [score]

        while dim > self.k_features:
            scores = []
            subsets = []

            for p in combinations(self.indices_, r=dim - 1):
                score = self._calc_score(X_train, y_train, 
                                         X_test, y_test, p)
                scores.append(score)
                subsets.append(p)

            best = np.argmax(scores)
            self.indices_ = subsets[best]
            self.subsets_.append(self.indices_)
            dim -= 1

            self.scores_.append(scores[best])
        self.k_score_ = self.scores_[-1]

        return self

    def transform(self, X):
        return X[:, self.indices_]

    def _calc_score(self, X_train, y_train, X_test, y_test, indices):
        self.estimator.fit(X_train[:, indices], y_train)
        y_pred = self.estimator.predict(X_test[:, indices])
        score = self.scoring(y_test, y_pred)
        return score



In [ ]:

    
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

# selecting features
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)

# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]

plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.02])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
# plt.savefig('images/04_08.png', dpi=300)
plt.show()



In [ ]:

    
k3 = list(sbs.subsets_[10])
print(df_wine.columns[1:][k3])



In [ ]:

    
k6 = list(sbs.subsets_[7])
print(df_wine.columns[1:][k6])



In [ ]:

    
knn.fit(X_train_std, y_train)
print('Training accuracy: %0.3f' % knn.score(X_train_std, y_train))
print('Test accuracy    : %0.3f' % knn.score(X_test_std, y_test))



In [ ]:

    
knn.fit(X_train_std[:, k3], y_train)
print('Training accuracy: %0.3f' % knn.score(X_train_std[:, k3], y_train))
print('Test accuracy    : %0.3f' % knn.score(X_test_std[:, k3], y_test))



In [ ]:

    
knn.fit(X_train_std[:, k6], y_train)
print('Training accuracy: %0.3f' % knn.score(X_train_std[:, k6], y_train))
print('Test accuracy    : %0.3f' % knn.score(X_test_std[:, k6], y_test))

Assessing feature importance with Random Forests

With scikit learn's implementation random Forests we can see feature importance.



In [ ]:

    
from sklearn.ensemble import RandomForestClassifier

feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=500,
                                random_state=1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]), 
        importances[indices],
        align='center')

plt.xticks(range(X_train.shape[1]), 
           feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('images/04_09.png', dpi=300)
plt.show()

This is great for finding discriminative features with one gotcha, if two or more features are highly correlated one feature may be highly ranked and information on the other feature(s) may not be fully captured. Not a problem if model performance is key but it would be if understanding feature importance is.

scikit-learn's SelectFromModel

Scikit-learn implements a SelectFromModel object that selects features based on a user-specified threshold after model fitting. Note the forest from above is passed in. Here we set a threshold 0.1 to get the top 5 features.

sfm = SelectFromModel(forest, threshold=0.1, prefit=True)



In [ ]:

    
from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(forest, threshold=0.1, prefit=True)
X_selected = sfm.transform(X_train)
print('Number of samples that meet this criterion: %d out of in the training set %d' % (X_selected.shape[0], X_train.shape[0]))



In [ ]:

    
for f in range(X_selected.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

Summary

handle missing data
encode categorical variables
map ordinal and nominal feature values to integer representations
regularization
sequential feature selection

Look at chapter 5 for feature extraction



In [ ]: